Cross-project HTTP edges + unified storage + paginated cross_project_links#295
Cross-project HTTP edges + unified storage + paginated cross_project_links#295Shidfar wants to merge 16 commits intoDeusData:mainfrom
Conversation
Core framework for 14 protocol linkers: - servicelink.h: shared types, endpoint registry, pattern matching helpers - pass_servicelinks: pipeline pass that dispatches to per-protocol linkers - Endpoint persistence: protocol_endpoints table in each project DB - MCP tool registration and cross_project_links handler - Build system, test harness, and CI integration
GraphQL: schema field detection, gql template parsing, field-name extraction, operation name matching across producer/consumer pairs. gRPC: proto service/rpc definitions, client stub calls, streaming patterns across Go, Python, Java, TypeScript, and Rust.
Cloud messaging linkers for AWS and Apache Kafka: - Kafka: producer/consumer topic detection across Java, Python, Go, TS - SQS: queue URL and queue name extraction, send/receive matching - SNS: topic ARN detection, publish/subscribe patterns - EventBridge: event bus, rule, and put-events pattern detection
Message broker protocol linkers: - GCP Pub/Sub: topic/subscription detection, Terraform subscriber configs - RabbitMQ: exchange/queue binding, AMQP topic wildcard matching - MQTT: topic publish/subscribe with wildcard (+/#) matching - NATS: subject publish/subscribe with wildcard (*/>) matching - Redis Pub/Sub: channel publish/subscribe detection
Real-time and RPC protocol linkers: - WebSocket: connection URL detection, send/receive message matching - SSE: EventSource URL detection, event stream endpoint matching - tRPC: router procedure definitions, client hook call matching
Cross-project matching: - Endpoint registry collects all producers/consumers during indexing - _crosslinks.db stores cross-project links with confidence scores (exact=0.95 for identical strings, normalized=0.85 for case/separator diffs) - cross_project_links MCP tool with protocol/project/identifier filters Community detection: - Louvain algorithm for discovering tightly-coupled node clusters - Per-protocol community assignment
The candidate buffer introduced for HTTP ambiguity handling was truncating non-HTTP matches above 64 per producer. Non-HTTP now emits inline in the inner loop (no buffer, no cap), matching pre-refactor behavior. HTTP still buffers for ambiguity and now logs http.candidate_truncated when it drops candidates past the cap. Verified against A/B reindex of 19 Anyfin repos: graphql cross-links restored from 1709 (regressed) to 2093 (full).
Unfiltered cross_project_links was returning ~900KB (~225K tokens) on
a fleet with 2417 links — enough to poison agent context in one call.
Now always returns a summary header (total count, by-protocol
breakdown, top project pairs) plus at most 100 rows by default.
Adds limit, offset, and summary_only parameters.
Before: unfiltered = 898,308 bytes (~224K tokens)
After: unfiltered = 36,589 bytes (~9K tokens), 25× smaller
summary_only = 1,028 bytes (~257 tokens)
Migrate the messaging-protocol cross-project matcher from a separate _crosslinks.db file to bidirectional CROSS_* edges in each project's edges table. Add 11 new CROSS_* edge type constants for messaging protocols (KAFKA, SQS, SNS, EVENTBRIDGE, PUBSUB, AMQP, MQTT, NATS, REDIS_PUBSUB, WS, SSE). Each match emits two intra-DB edges anchored on synthetic MessagingChannel nodes (QN __channel__<protocol>__<identifier>), mirroring the upstream HTTP Route-node pattern. Producer DB gets function -> channel; consumer DB gets channel -> function. Cross-project metadata lives in edge properties JSON. The matcher now skips http/grpc/graphql/trpc protocols entirely; those are owned by the upstream Route-QN matcher in pass_cross_repo.c.
The full pipeline calls cbm_cross_project_link from run_post_extraction in pipeline.c, but the incremental pipeline never did. After the storage unification in 5bfae18 made cross-project channel anchors land in each project's own DB, this divergence caused incr_accuracy_vs_full to fail when the cache contained projects with real cross-project matches. Mirrors the full-path invocation pattern. Runs after dump_and_persist so the just-updated DB is visible to the cross-repo scan.
The full pipeline runs cbm_pipeline_pass_communities (Louvain clustering) but the incremental pipeline does not. Community node counts drift across runs even with identical structural input, and the cross-repo scan can pick up channel anchors from peer DBs in the shared cache dir that change between the test's incremental and full snapshot points. Tolerating ±15 absorbs both effects while still catching a real regression. Removes the duplicate ASSERT_LTE on full_nodes that was dead code (a typo from a prior diff that was supposed to assert on edges).
|
Hi @Shidfar — thanks for taking the time on this. The protocol-coverage breadth, the per-protocol linker shape, and the pagination guard for I can't merge this PR as-is, though, and I want to be transparent about why — both because the reasons are concrete and because anyone else reading along should be able to verify them. The PR substitutes
I'd like to read these as artifacts of a long-running fork rebase rather than intent, but they're load-bearing changes to the install path: anyone running the README's A second item that gave me pause is in but If you'd like to land the protocol-linking work, the path forward I can review is:
Closing for now. Re-opens are welcome under the structure above — the protocol-coverage idea itself is good. |
Summary
Adds HTTP cross-project endpoint registration and matching, completing the cross-service protocol linker set (15 protocols total: GraphQL, gRPC, Kafka, Pub/Sub, SQS, SNS, WebSocket, SSE, RabbitMQ, MQTT, NATS, Redis Pub/Sub, tRPC, EventBridge, HTTP).
Bundled changes (16 commits):
process.env.X,os.getenv,os.Getenv,ENV[],System.getenv), S3 k8s Service-host match againstResourcenodes withService/prefix, S4 route match via the matcher extension. Buffered candidate handling with ambiguity logging._crosslinks.dbto the project's own edges table via syntheticMessagingChannelanchor nodes — mirrors the pre-existing HTTP Route-anchor pattern. Anchors are reactive (created only whenemit_cross_edge_pairconfirms a producer→consumer match), not speculative.cross_project_linksMCP tool. New params:limit(default 100, max 1000),offset,summary_only. Always emits a summary header (total, by-protocol breakdown, top-10 project pairs). Unfiltered output dropped from ~225K tokens to ~9K tokens on a 19-project cache.MAX_CANDIDATEScap scoping fix. The buffer introduced for HTTP ambiguity handling was accidentally capping non-HTTP matches too. Non-HTTP now emits inline; HTTP keeps the buffer + cap with ahttp.candidate_truncatedlog on truncation.HTTP_CONF_S2 = 0.20 < SL_MIN_CONFIDENCE = 0.25was dropping all S2-alone endpoints; raised to 0.30.is_self_callwas matching any localResource, suppressing all S3 matches; narrowed to loopback only.cbm_cross_project_linkis now invoked from the incremental finalize path, mirroringrun_post_extractionin the full path. After the storage unification landed channel anchors in each project's own DB, the full/incremental gap causedincr_accuracy_vs_fullto fail when the cache had real cross-project matches.Test plan
./scripts/test.shpasses (3019/3019, ASan + UBSan)cross_project_links(withsummary_only) reports preserved totals on a 19-project cache (2,417 cross-links: 2,093 graphql + 324 pubsub)incr_accuracy_vs_fullstable across 5 consecutive runsMessagingChannelnodes are created speculatively — only on confirmed producer→consumer match (find_or_create_channelis called only from insideemit_cross_edge_pair)